Introduction:

Within this analysis, we’ll investigate factors and

knitr::opts_chunk$set(echo = FALSE, message = FALSE)

library(tidyverse) ## Loaded for dplyr
library(ggplot2) ## Loaded for plotting
library(plotly) ## Loaded for interactive plots
library(readr) ## Loaded to read in data
library(knitr) ## Loaded to compute and display data
library(scales) ## Loaded to scale data 

Diabetes Dataset

100,000 × 9 (first 6 rows)
gender age hypertension heart_disease smoking_history bmi HbA1c_level blood_glucose_level diabetes
Female 80 0 1 never 25.19 6.6 140 0
Female 54 0 0 No Info 27.32 6.6 80 0
Male 28 0 0 never 27.32 5.7 158 0
Female 36 0 0 current 23.45 5.0 155 0
Male 76 1 1 current 20.14 4.8 155 0
Female 20 0 0 never 27.32 6.6 85 0

Male vs. Female Blood Sugar Levels (HbA1c) Plot

99,982 × 10 (first 6 rows)
gender age hypertension heart_disease smoking_history bmi HbA1c_level blood_glucose_level diabetes HbA1c_category
Female 80 0 1 never 25.19 6.6 140 0 Diabetes ≥ 6.5%
Female 54 0 0 No Info 27.32 6.6 80 0 Diabetes ≥ 6.5%
Male 28 0 0 never 27.32 5.7 158 0 Prediabetes 5.7% - 6.4%
Female 36 0 0 current 23.45 5.0 155 0 Normal < 5.7%
Male 76 1 1 current 20.14 4.8 155 0 Normal < 5.7%
Female 20 0 0 never 27.32 6.6 85 0 Diabetes ≥ 6.5%

Similar Prevalence of Prediabetes – The proportion of individuals categorized as having prediabetes (HbA1c 5.7% - 6.4%) is almost identical between males (41.3%) and females (41.4%). This suggests that prediabetes affects both genders at nearly the same rate.

Females Have a Slightly Higher Proportion of Normal Blood Sugar Levels – More females (38.4%) fall into the normal blood sugar category (<5.7%) compared to males (37.1%). This may indicate some slight protective factors or lifestyle differences in this group.

Since more males are in the diabetes category, there could be gender-related risk factors worth exploring—such as diet, activity levels, or genetic predisposition.

Overall, blood sugar regulation patterns appear fairly balanced between genders, but small differences suggest potential areas for further investigation.

Similar Prevalence of Prediabetes
The proportion of individuals classified as having prediabetes (HbA1c 5.7% - 6.4%) is nearly identical between males (41.3%) and females (41.4%). This suggests no significant disparity.

BMI Distribution by Hypertension Status Plot

Shows the distribution of BMI values based on hypertension status. A violin plot is great for visualizing the distribution and density of BMI across hypertension categories,

Shape and width: The width of each “violin” represents the density of BMI values at different levels. Wider sections mean more individuals have that BMI, while narrower sections indicate fewer people at those values.

Comparison of distributions: The blue violin represents people without hypertension (hypertension = 0), while the red violin represents those with hypertension (hypertension = 1). By comparing them, you can see how BMI differs between these groups.

The horizontal line around 25 BMI: This marks the median BMI for each group. Since both violins have a horizontal line in roughly the same position, it suggests that the median BMI is around 25 for both hypertensive and non-hypertensive individuals.

Density trends: If the violins have different thicknesses in certain BMI ranges, it tells you which BMI values are more or less common in each group. People with hypertension seem to have a higher BMI overall, but both groups share a similar median.

The distribution shape is different—for example, if one violin is wider at higher BMI values, it suggests that hypertension is more common among individuals with higher BMI.

Outliers or extreme values might appear as small bulges or extended tails at the ends of the violins, showing individuals with very high or low BMI.

BMI vs. Age Across Diabetes & Heart Disease Plot

Smokers go brrr

In the smoking data there are 6 unique values

  1. Never: Has Never smoked
  2. Not current: Has smoked but is not currently smoking
  3. Former: Has quit smoking (abstained for longer than)
  4. Current: Is currently a smoker
  5. Ever: Has ever smoked regardless of current smoking status
  6. No Info: No smoking history information available

The total amount of people who fall into each category is as follows;

  1. Never: 35095
  2. Not current: 6447
  3. Former: 9352
  4. Current: 9286
  5. Ever: 4004
  6. No Info: 35816

There is quite a sizable amount of people in the No info category.

The total number of people in the dataset is 100000. To help clean up the data, we can filter ‘No Info’ people out. When we do that we get 64184.

## [1] "never"       "No Info"     "current"     "former"      "ever"       
## [6] "not current"
## # A tibble: 6 × 2
##   smoking_history total_people
##   <chr>                  <int>
## 1 No Info                35816
## 2 current                 9286
## 3 ever                    4004
## 4 former                  9352
## 5 never                  35095
## 6 not current             6447
## `summarise()` has grouped output by 'smoking_history'. You can override using
## the `.groups` argument.

Now we can graph the relationship between

library(dplyr)
library(ggplot2)
library(scales)

df_summary <- diabetes_dataset %>%
  mutate(
    HbA1c_cat = case_when(
      HbA1c_level < 5.7                      ~ "< 5.7 (non‑diabetic)",
      HbA1c_level >= 5.7 & HbA1c_level < 6.5 ~ "5.7–6.4 (prediabetic)",
      HbA1c_level >= 6.5                    ~ "≥ 6.5 (diabetic)"
    ),
    hypertension = factor(hypertension, levels = c(0,1),
                          labels = c("No", "Yes"))
  ) %>%
  group_by(HbA1c_cat, hypertension) %>%
  summarise(n = n(), .groups = "drop_last") %>%
  mutate(percent = n / sum(n))  # auto‑groups by HbA1c_cat




ggplot(df_summary, aes(x = HbA1c_cat,
                       y = percent,
                       fill = factor(hypertension))) +
  geom_col(position = "stack") +
  geom_text(aes(label = scales::percent(percent, 1)),
            position = position_stack(vjust = 0.5)) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_discrete(name = "Hypertension",
                      labels = c("No", "Yes")) +
  labs(
    title = "Hypertension Status by HbA1c Category",
    x     = "HbA1c Category",
    y     = "Percent within Category"
  ) +
  theme_light()